Screenshot
Nutch Web Interface Search |
|
Developer(s) | Apache Software Foundation |
Stable release | 1.4 / December 26, 2011 |
Development status | Active |
Written in | Java |
Operating system | Cross-platform |
Type | Search Engine |
License | Apache License 2.0 |
Website | nutch.apache.org |
Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.
Contents |
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]
Some of the advantages of Nutch, when compared to a simple Fetcher
IBM Research studied the performance[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[5]
|